RidgeRun NVIDIA PVA Development Algorithms
PVA Algorithms from LibPVA
RidgeRun has implemented the following image processing algorithms on the PVA. These are foundational for image signal processing (ISP) pipelines and optimized for high efficiency.
Get access to the FREE evaluation version of the PVA sample binaries here:
All the measurements were taken using the following characteristics:
- Platform: Jetson AGX Orin 32GB
- OS: Jetpack 6.2
- Power Profile: MAXN power mode + Jetson Clocks
- CPU: All measurements use aggressive compiler optimization flags and OpenMP. Introducing NEON might halve the execution times.
- PVA: All measurements use a single PVA (with two VPU cores)
- Power Measurements: using jetson-stats (a tool based on tegrastats) with a VDD_CPU_CV power meter probe.
The profiling details are:
- Execution time CPU (ms): using one ARM core execution
- Execution time PVA (ms): using one PVA / two DSP slices (VPU) execution
- Power Consumption CPU only (W): using a number of cores such that the execution time of the CPU is nearly the same as the PVA (iso-perf).
- Power Consumption PVA only (W): using a single PVA (with two VPU cores)
The power consumption has been acquired at the entire platform level using the jetson-stats Python library.
Bit Shifting (Debayering Resolution Downscaling)
This technique allows for resolution reduction through controlled bit manipulation during debayering. It’s useful in optimizing bandwidth or matching downstream resolution requirements.
Average performance measurements are shown in the following table for the most common resolutions. Measurements are shown for an optimized implementation of the algorithm, and all results are in milliseconds. Additionally, power consumption measurements are shown in watts. A shift of 10 bits was used for the benchmarks. Performance measurements can also be observed in the attached graph.
Resolution | Execution time CPU (ms) | Execution time PVA (ms) | Performance Ratio (PVA/CPU) | Power consumption CPU only (W) | Power consumption PVA (W) | Power Ratio (CPU/PVA) |
---|---|---|---|---|---|---|
1280x720 | 0.309 | 0.04865 | 6.35x | 8.75 | 3.21 | 2.73x |
1920x1080 | 0.675 | 0.10678 | 6.32x | 9.14 | 3.27 | 2.80x |
3840x2160 | 2.51 | 0.4061 | 6.18x | 9.54 | 3.24 | 2.94x |

This downscales a single-channel image from 16-bit to 8-bit. To match the latency of the PVA, it is required to use six ARM cores.
Radial Lens Shading Correction
Corrects vignetting or intensity falloff from the center to the edges of an image caused by lens characteristics. It’s implemented using radial correction maps that are efficiently processed on the PVA.
Average performance measurements are shown in the following table for the most common resolutions. Measurements are shown for an optimized implementation of the algorithm, and all results are in milliseconds. Additionally, power consumption measurements are shown in watts. Performance measurements can also be observed in the attached graph.
Resolution | Execution time CPU (ms) | Execution time PVA (ms) | Performance Ratio (PVA/CPU) | Power consumption CPU only (W) | Power consumption CPU and PVA (W) | Power Ratio (CPU/PVA) |
---|---|---|---|---|---|---|
1280x720 | 1.56 | 0.145 | 10.75x | 8.4 | 3.69 | 2.28x |
1920x1080 | 3.5 | 0.330 | 10.6x | 7.6 | 3.61 | 2.11x |
3840x2160 | 13.8 | 1.402 | 9.84x | 7.2 | 3.57 | 2.02x |

The measurements were done with:
- 8-bit Fixed-point correction maps (including channels)
- RGB images (RGB24) - 8-bit per channel
- ARM CPU requires ten ARM cores to match the PVA latency.
Colour Space Conversion (RGBA-Gray)
Transforms image data from one color space to another (e.g., RGB to YUV). It’s essential for encoding, display pipelines, and transmission where non-RGB formats are used.
These implementations showcase how RidgeRun leverages the PVA to create real-time, power-efficient vision pipelines suitable for embedded systems under tight performance constraints.
Average performance measurements are shown in the following table for the most common resolutions. Measurements are shown for an optimized version of the algorithm, and all results are in milliseconds. Additionally, power consumption measurements are shown in watts. In the example measurements, an RGBA to Grayscale conversion was performed. Performance measurements can also be observed in the attached graph.
Resolution | Execution time CPU (ms) | Execution time PVA (ms) | Performance Ratio (PVA/CPU) | Power consumption CPU only (W) | Power consumption PVA (W) | Power Ratio (CPU/PVA) |
---|---|---|---|---|---|---|
1280x720 | 1.36 | 0.085 | 16.0x | 10.35 | 3.97 | 2.6x |
1920x1080 | 3.05 | 0.195 | 15.6x | 10.74 | 3.84 | 2.8x |
3840x2160 | 12.1 | 0.746 | 16.2x | 10.74 | 3.61 | 2.98x |

The images involved:
- Input: RGBA32 (8-bit per channel, four channels)
- Output: Gray8 (8-bit single channel)
- The CPU requires twelve cores to closely match the PVA's latency.
Colour Space Conversion (YUY2-RGBA)
Transforms image data from one color space to another (e.g., YUY2 to RGBA). It’s essential for encoding, display pipelines, and transmission where non-RGB formats are used.
These implementations showcase how RidgeRun leverages the PVA to create real-time, power-efficient vision pipelines suitable for embedded systems under tight performance constraints.
Average performance measurements are shown in the following table for the most common resolutions. Measurements are shown for an optimized version of the algorithm, and all results are in milliseconds. Additionally, power consumption measurements are shown in watts. In the example measurements, a YUY2 - RGBA conversion was performed. Performance measurements can also be observed in the attached graph.
Resolution | Execution time CPU (ms) | Execution time VIC (ms) | Execution time PVA (ms) | Performance Ratio (PVA/CPU) | Performance Ratio (VIC/PVA) | Power consumption CPU only (W) | Power consumption VIC only (W) | Power consumption PVA (W) | Power Ratio (CPU/PVA) | Power Ratio (VIC/PVA) |
---|---|---|---|---|---|---|---|---|---|---|
1280x720 | 12.7 | 2.0 | 0.187 | 67.91x | 10.7x | 11.12 | 2.385 | 3.578 | 3.1x | 0.67x |
1920x1080 | 28.6 | 5.2 | 0.423 | 67.6x | 12.3x | 11.12 | 2.385 | 3.578 | 3.1x | 0.67x |
3840x2160 | 113.3 | 16.3 | 1.663 | 68.1x | 9.8x | 11.12 | 2.385 | 3.578 | 3.1x | 0.67x |

Resolution | Execution time CPU (ms) | Execution time VIC (ms) | Execution time PVA (ms) | Performance Ratio (PVA/CPU) | Performance Ratio (VIC/PVA) |
---|---|---|---|---|---|
1280x720 | 4.82 | 1.3 | 0.157 | 30.9x | 8.28x |
1920x1080 | 10.7 | 2.8 | 0.334 | 32.0x | 8.38x |
3840x2160 | 42.7 | 10.5 | 1.271 | 33.6x | 8.26x |

The images involved:
- Input/Output: YUY2 (YUV 4:2:2 interleaved)
- Output/Input: RGBA32 (8-bit per channel, four channels)
- The CPU requires twelve cores to match the PVA's latency. The implementation was the GStreamer's videoconvert element due to its popularity and relevance in ISP.
2D Filtering (Convolution)
Applies a 2D filter using a non-separable 5x5 kernel, it can be used for general image filtering as well as showcasing general 2D convolution performance.
Average performance measurements are shown in the following table for the most common resolutions. Measurements are shown for an optimized version of the algorithm, and all results are in milliseconds. Additionally, better performance can be achieved with further optimization as shown in NVIDIA's PVA Solutions implementation of the convolution. Performance measurements can also be observed in the attached graph.
The CPU implementation is based on cv::filter2D with a 5x5 non-separable kernel.
Resolution | Execution time CPU (ms) | Execution time PVA (ms) | Performance Ratio (PVA/CPU) |
---|---|---|---|
1280x720 | 6.84 | 0.179 | 38.2x |
1920x1080 | 15.37 | 0.378 | 40.66x |
3840x2160 | 61.64 | 1.503 | 41.1x |

The images involved:
- Input: Gray8 (16-bit single channel)
- Output: Gray8 (16-bit single channel)
- The CPU code uses OpenCV functions (cv::filter2D).
Black Level Correction
Corrects sensor black level offsets by normalizing the pixel intensity baseline, ensuring true blacks and accurate dark-region details. It’s implemented using selectable offset adjustments that are efficiently processed on the PVA.
Average performance measurements are shown in the following table for the most common resolutions. Measurements are shown for an optimized implementation of the algorithm, and all results are in milliseconds. Additionally, power consumption measurements are shown in watts. Performance measurements can also be observed in the attached graph.
Resolution | Execution time CPU (ms) | Execution time PVA (ms) | Performance Ratio (PVA/CPU) | Power consumption CPU only (W) | Power consumption CPU and PVA (W) | Power Ratio (CPU/PVA) |
---|---|---|---|---|---|---|
1280x720 | 3.2 | 0.116 | 27.58x | 10.4 | 3.21 | 3.24x |
1920x1080 | 7.2 | 0.263 | 27.37x | 10.4 | 3.20 | 3.25x |
3840x2160 | 28.6 | 1.059 | 27.0x | 10.8 | 3.21 | 3.36x |

The measurements were done with:
- 4-bit Fixed-point correction maps (including channels)
- RGB images (RGB24) - 8-bit per channel
- ARM CPU requires twelve ARM cores to match the PVA latency.
Access to PVA Solutions
Access to the NVIDIA PVA Solutions has allowed us to push performance boundaries significantly, guided by the insights provided in each example by the PVA architects. For instance, in the RGBA-to-Greyscale colour space conversion, we reduced the execution time from 8 ms at 1080p to just 0.746 ms—a 10x speedup. This improvement was achieved by leveraging the diverse programming techniques available on the PVA, as demonstrated through the Solutions. A similar gain was observed in the 2D Filter, where execution time was reduced by half, delivering a 2x speedup.
Final Remarks
The PVA not only outperforms the CPU in performance per watt, but also delivers faster execution—particularly when leveraging its inherently parallel architecture. By combining speed and energy efficiency, it provides a strong advantage in embedded vision systems. Here are some remarks:
- Energy efficiency (performance per watt)
The PVA delivers significantly higher efficiency compared to conventional CPUs. Thanks to its VLIW SIMD architecture, purpose-built for computer vision tasks, it achieves high throughput with low power consumption—an essential feature for embedded applications requiring autonomy, continuous operation, and low latency.
- CPU and GPU offloading
The PVA offloads many pre- and post-processing tasks (such as filtering, remapping, pyramids, lens correction, etc.) from the CPU and GPU. This frees system resources for more demanding workloads like AI inference, improving overall system efficiency.
- Tailored for real-time embedded vision
With it's programmability by way of an optimizing C/C++ compiler, specialized fixed function units such as DLUT, low power profile, and deterministic execution, the PVA is an excellent fit for real-time, continuous vision applications in domains such as autonomous mobility, robotics, and surveillance—where both responsiveness and energy efficiency are critical